The goal of this study is to investigate the integration of semantic and pragmatic information during word learning in children between 2 and 5 years of age. As a first step, we replicate earlier work showing that children rely on different forms of pragmatic and semantic information during word learning. In experiment 1, we show that children make a so called mutual exclusivity inference and that this inference depends on children’s developing semantic knowledge. In a second experiment, we show that children make inferences about word meanings based on common ground. In a third experiment, we find that, when combined in one procedure, children are sensitive to the way that the two inferences are aligned.
Next, we introduce a computational framework which we use to formalize the process of how the inferences are integrated. As a part of this, we identify three information sources that children consider when making the alleged inferences: semantic knowledge, expectations about speaker informativeness and sensitivity to common ground. We then use the our modelling framework to ask which of these information sources are necessary to predict children’s responses in experiment 3.
In the final section, we turn to the process by which information is integrated. We contrast the process of rational Bayesian inference we introduced in our model with a biased integration process in which one type of inference is given more weight. As part of this, we also explore alternative ways to think about developmental change in the integration process.
The first experiment tested the so called mutual exclusivity inference in children between 2 and 5 years of age. The general phenomena is that when presented with a familiar and an unfamiliar object, children expect a novel word to refer to the unfamiliar object (e.g. Markman and Wachtel 1988). A range of explanations have been put forward for the cognitive basis of this inference (see Lewis et al. 2020 for a discussion). Here, we treat the mutual exclusivity inference as pragmatic (e.g. Clark 1987). The inference process is specified in the model below.
The first goal of this experiment was to quantify developmental change in the age range tested. The second goal was to test the role of semantic knowledge (cf. Lewis et al. 2020). The assumption is that the strength of the mutual exclusivity inference varies with knowledge of the word for the familiar object. That is, when the familiar object is an object for which children are less likely to know the word, they are less likely to assume that the novel word refers to the unfamiliar object. To test this, we systematically varied the familiar object that were presented alongside the novel object.
The experiment was preregistered at https://osf.io/gy37b. The experiment itself can be run by downloading the associated repository and opening the file experiments/kids/kids_me.html.
We tested a total number of 90 children, including 30 2-year-olds (range = 2.03 - 3.00, 15 girls), 30 3-year-olds (range = 3.03 - 3.97, 22 girls) and 30 4-year-olds (range = 4.03 - 4.90, 16 girls). Data from 10 additional children was not included because they were either exposed to less than 75% of English at home (5), did not finish at least half of the test trials (2), the technical equipment failed (2) or their parents reported an autism spectrum disorder (1). All children were recruited from the floor of a Children’s museum in San José, California, USA. This population is characterized by diverse ethnic background (predominantly White, Asian, or mixed ethnicity) and high levels of parental education and socioeconomic status. Parents consented to their children’s participation and provided demographic information. All experiments were approved by the Stanford Institutional Review Board (protocol no. 19960).
The experiment was presented as an interactive picture book on a tablet computer (Frank et al. 2016). Figure 1A shows the general setup. Children saw an animal standing on a little hill between two tables. For each animal character, we recorded a set of utterances (one native English speaker per animal) that were used to make requests. Each experiment started with two training trials in which the speaker requested known objects (car and ball).
In experiment 1, on one table, there was a familiar object, on the other table, there was a novel object (drawn for the purpose of the study). The speaker requested an object by saying “Oh cool, there is a [non-word] on the table, how neat, can you give me the [non-word]?”. Children responded by touching one of the objects. The location of the novel object (left or right table) and the animal character were counterbalanced. Each child received 12 trials, one with each familiar object. The novel object also changed from trial to trial. We coded as correct choice if children chose the novel object as the referent of the novel word.
Figure 1: Schematic experimental procedure with screenshots from the experiments.
Each child completed 12 trials, each with a different familiar and a different novel object. Familiar objects were selected to vary along the dimension of how likely children were to know the word for each object. This including objects that most 2-year-olds could name (e.g. a duck) as well as objects that only very few 5-year-olds could name (e.g. a pawn). The selection was based on age of acquisition ratings from Kuperman and colleagues (2012). While these ratings do not capture the absolute age when children acquire these words, they capture the relative order in which words are learned. Figure 2A shows the objects used in the experiment. We induced this variation to estimate the role of semantic knowledge in a mutual exclusivity inference.
chance_me <- me_data %>%
group_by(subage, subid) %>%
summarise(correct = mean(correct)) %>%
summarise(correct = list(correct)) %>%
group_by(subage)%>%
mutate(Mean= round(mean(unlist(correct)),2),
BayesFactor = format(round(extractBF(ttestBF(unlist(correct), mu = 0.5))$bf), scientific = F),
`Age group` = subage)%>%
ungroup()%>%
select(`Age group`, Mean, BayesFactor)
knitr::kable(chance_me, caption = "Proportion of children choosing the novel object compared to a level expected by chance based on a one sample Bayesian t-test. Responses are aggregated for each participant across familiar objects.", digits = 2, align = "l")
| Age group | Mean | BayesFactor |
|---|---|---|
| 2 | 0.61 | 132 |
| 3 | 0.73 | 185881356 |
| 4 | 0.86 | 72514087738 |
As a first step, we evaluated whether children made a mutual exclusivity inference. For this analysis, we aggregated participants’ responses across familiar objects. We used the function ttestBF from the R-package BayesFactor (Morey and Rouder 2018) to compute a Bayes factor (BF) in favor of the hypothesis that children chose the novel object more often than expected by chance (50% correct). Table 1 shows that all age groups made the inference.
# prior_me <- c(prior(normal(0, 5), class = Intercept),
# prior(normal(0, 5), class = b),
# prior(cauchy(0, 1), class = sd))
#
#
# bm_me <- brm(correct ~ age + (1|subid) + (age | item) + (age | agent),
# data = me_data, family = bernoulli(),
# control = list(adapt_delta = 0.99, max_treedepth = 20),
# sample_prior = F,
# prior = prior_me,
# cores = 4,
# chains = 4,
# iter = 5000)%>%
# saveRDS(.,"../saves/bm_me.rds")
#
#
# bm_me2 <- brm(correct ~ age + (1|subid) + (age | agent),
# data = me_data, family = bernoulli(),
# control = list(adapt_delta = 0.99, max_treedepth = 20),
# sample_prior = F,
# prior = prior_me,
# cores = 4,
# chains = 4,
# iter = 5000)%>%
# saveRDS(.,"../saves/bm_me.rds")
bm_me <- readRDS("../saves/bm_me.rds")
bm_me2 <- readRDS("../saves/bm_me2.rds")
As a second step, we investigated how the inference changed as a function of age and the familiar object. We modeled the trial by trial data using a Bayesian generalized linear mixed model (GLMM). We used the function brm from the package brms (Bürkner 2017). We pre-registered the use of default priors in all models. However, the model in Experiment 3 was unable to initialize with default priors and we thus used weakly informative priors for all models to be consistent. The priors we used were normal(0,5) for fixed effects and cauchy(0,1) for standard deviations of random effects. The model formula was correct ~ age + (1 | id) + (age | object) + (age | agent). That is, we modeled an overall slope for age (continuous, anchored at the minimum) and the object specific developmental trajectories as deviations from the overall intercept and slope (random effects). We did not pre-register agent as a random effect, but retrospecitvely included it to be consistent with Experiment 2 and 3.
The estimate for age was positive and reliably different from zero (\(\beta\) = 0.91, 95% CrI: 0.58 - 1.3). Older children were more likely to make a mutual exclusivity inference. To assess the variability accross objects, we compared the fit of the above model to a model lacking object as a random effect. Following McElreath (2016), we compared models using WAIC (widely applicable information criterion) scores and weights. The WAIC score is an indicator of the model’s predictive accuracy for out of sample data; model’s with lower scores are preferred. WAIC weights are an estimate of the probability that this model (compared to all other models considered) will make the best predictions on new data. The model including object provided a much better fit compared to the model lacking it (see Table 2). Figure 2B visualizes the model based developmental trajectory for each familiar object and illustrates the substantial variation between them, both in terms of absolute strength of the inference as well as its developmental trajectory. Figure 2C shows the correlation between rated age of acquisition and object specific model intercept. The mutual exclusivity effect was stronger for words that were rated to be acquired earlier. Objects for which children were less likely to know the word produced a weaker mutual exclusivity effect. Taken together, the strength of the mutual exclusivity inference depended on age as well as the familiar object.
me_waic <- brms::waic(bm_me, bm_me2, compare = F)
me_weights <- model_weights(bm_me, bm_me2, weights = "waic")
me_comp <- tibble(
Model = c("with object as RE", "without object as RE"),
WAIC = round(c(me_waic$loos$bm_me$estimates["waic","Estimate"],me_waic$loos$bm_me2$estimates["waic","Estimate"]),2),
SE = round(c(me_waic$loos$bm_me$estimates["waic","SE"],me_waic$loos$bm_me2$estimates["waic","SE"]),2),
weight = round(c(me_weights[1],me_weights[2]))
)
knitr::kable(me_comp, caption = "Model comparison in experiment 1 based on WAIC scores and weights.", digits = 2, align = "l")
| Model | WAIC | SE | weight |
|---|---|---|---|
| with object as RE | 1089.05 | 32.19 | 1 |
| without object as RE | 1202.46 | 31.28 | 0 |
Figure 2: A:Familiar words and corresponding pictures by rated age of acquisition. B: Developmental trajectories of mututal exclusivity effect by familiar object based on the mean of the model posterior distribution. Dots show individual datapoints. Lighter colors indicate later rated age of acquisition. Dotted line indicates a level of performance expected by chance. C: Correlation between rated age of acquisiton and mutual exclusivity effect (model based intercept for each familiar object).
Here we tested children’s sensitivity to common ground that is build up over the course of a conversation. In particular, we tested whether children keep track of which object is new to a speaker and which they have encountered previously (Akhtar, Carpenter, and Tomasello 1996; Diesendruck et al. 2004). The main goal of the experiment was to measure how children’s sensitivity to common ground changes with age.
The experiment was preregistered at https://osf.io/au5hr. The experiment itself can be run by downloading the associated repository and opening the file experiments/kids/kids_novel.html.
We tested 58 children from the same general population as in Experiment 1, including 18 2-year-olds (range = 2.02 - 2.93, 7 girls), 19 3-year-olds (range = 3.01 - 3.90, 14 girls) and 21 4-year-olds (range = 4.07 - 4.93, 14 girls). Data from 5 additional children was not included because they were either exposed to less than 75% of English at home (3) or the technical equipment failed (2).
The general setup was the same as in Experiment 1. The speaker was positioned between the tables. There was a novel object (drawn for the purpose of the study) on one of the tables while the other table was empty. Next, the speaker turned to one of the tables and either commented on the presence (“Aha, look at that.”) or the absence (“Hm, nothing there”) of an object. Then the speaker disappeared. While the speaker was away, a second novel object appeared on the previously empty table. Then the speaker returned and requested an object in the same way as in Experiment 1 (see also Figure 1B). The positioning of the novel object in the beginning of the experiment as well as the location the speaker turned to first was counterbalanced. Children received five trials, each with a different pair of novel objects. We coded as correct choice if children chose the object that was new to the speaker as the referent of the novel word.
chance_prior <- prior_data %>%
group_by(subage, subid) %>%
summarise(correct = mean(correct)) %>%
summarise(correct = list(correct)) %>%
group_by(subage)%>%
mutate(Mean= round(mean(unlist(correct)),2),
BayesFactor = format(round(extractBF(ttestBF(unlist(correct), mu = 0.5))$bf,2), scientific = F),
`Age group` = subage)%>%
ungroup()%>%
select(`Age group`, Mean, BayesFactor)
knitr::kable(chance_prior, caption = "Proportion of children choosing the object that was new to the speaker compared to a level expected by chance based on a one sample Bayesian t-test. Responses are aggregated for each participant across trials.", digits = 2, align = "l")
| Age group | Mean | BayesFactor |
|---|---|---|
| 2 | 0.55 | 0.4 |
| 3 | 0.76 | 26.55 |
| 4 | 0.83 | 6956.06 |
Table 3 compares children’s correct responses to a level expected by chance (50%). We found evidence that, as a group, 3- and 4-year-olds, but not 2-year-olds, inferred that the novel word referred to the object that was new to the speaker.
# prior_cg <- c(prior(normal(0, 5), class = Intercept),
# prior(normal(0, 5), class = b),
# prior(cauchy(0, 1), class = sd))
#
#
# bm_cg <- brm(correct ~ age + (1|subid) + (age | agent),
# data = prior_data, family = bernoulli(),
# control = list(adapt_delta = 0.99, max_treedepth = 20),
# sample_prior = F,
# prior = prior_cg,
# cores = 4,
# chains = 4,
# iter = 5000)%>%
# saveRDS(.,"../saves/bm_cg.rds")
bm_cg <- readRDS("../saves/bm_cg.rds")
To directly investigate whether children’s response changed with age, we modeled the trial by trial data using a Bayesian GLMM (formula: correct ~ age + (1 | id) + (age | speaker), specifications see Experiment 1). The estimate for age was positive and reliably different from zero (\(\beta\) = 0.92, 95% CrI: 0.37 - 1.54, see Figure 3A). Older children were more likely to chose the object that was new to the speaker as the referent of the novel word, suggesting that the sensitivity to common ground in this context increases with age.
Experiment 3 combined the procedures from Experiment 1 and 2. As a consequence, children had to consider not just their semantic knowledge of the word for the familiar object and the inference this licences but also the role that each object (novel and familiar) had played in the preceding interaction. Combining the two procedures created two conditions: In the congruent condition, the novel object was also the object that was new to the speaker. In this case, the mutual exclusivity inference as well as the common ground inference pointed to the novel object as the referent. In the incongurent condition, the familiar object was new to the speaker. Int his case, the two inferences pointed to different objects. The main focus of the overall study was to model how children integrate and balance these different information sources. We investigate this question in depth in the modelling section below. Here, we limit the discussion to whether children differentiated between the two conditions.
The experiment was preregistered at https://osf.io/4nm8g. The experiment itself can be run by downloading the associated repository and opening the file experiments/kids/kids_combination.html.
We tested 220 children from the same general population as in Experiment 1 and 2, including 76 2-year-olds (range = 2.04 - 2.99, 7 girls), 72 3-year-olds (range = 3.00 - 3.98, 14 girls) and 72 4-year-olds (range = 4.00 - 4.94, 14 girls). Data from 20 additional children was not included because they were either exposed to less than 75% of English at home (15), did not finish at least half of the test trials (3) or the technical equipment failed (2).
Experiment 3 followed the same procedure as Experiment 2 but involved the same objects as Experiment 1. In the beginning, one table was empty while there was an object (novel or familiar) on the other one. After commenting on the presence or absence of an object on each table, the speaker disappeared and a second object appeared (familiar or novel). Next, the speaker re-appeared and made the usual request.
In the congruent condition, the familiar object was present in the beginning and the novel object appeared while the speaker was away (Figure 1C - left). In this case, both the mutual exclusivity and the common ground inference pointed to the novel object as the referent. In the incongruent condition, the novel object was present in the beginning and the familiar object appeared later. In this case, the two inferences pointed to different objects (Figure 1C - right).
Participants received up to 12 test trials, six in each condition, each with a different familiar and novel object. Familiar objects were the same as in Experiment 1. The positioning of the objects on the tables and the location the speaker first turned to were counterbalanced. Participants could stop the experiment after six trials (three per condition). If a participant stopped after half of the trials, we tested an additional participant to reach a pre-registered number of data points per cell.
All results are reported from the perspective of the mutual exclusivity inference (correct in the model formula below). In the incongruent condition, high proportions speak to a mutual exclusivity inference and low proportion for a common ground inference. In the congruent condition, both inferences pointed in the same direction. The focus of this experiment was on information integration and we therefore did not compare the performance to chance.
# prior_comb <- c(prior(normal(0, 5), class = Intercept),
# prior(normal(0, 5), class = b),
# prior(cauchy(0, 1), class = sd))
#
# bm_comb <- brm(correct ~ age * alignment + (alignment | subid) + (age * alignment | item)+ (age * alignment | agent),
# data = comb_data, family = bernoulli(),
# control = list(adapt_delta = 0.99, max_treedepth = 20),
# sample_prior = F,
# prior = prior_comb,
# cores = 4,
# chains = 4,
# inits = 0,
# iter = 5000)%>%
# saveRDS(.,"../saves/bm_comb.rds")
#
# bm_comb2 <- brm(correct ~ age * alignment + (alignment | subid) + (age * alignment | agent),
# data = comb_data, family = bernoulli(),
# control = list(adapt_delta = 0.99, max_treedepth = 20),
# sample_prior = F,
# prior = prior_comb,
# cores = 4,
# chains = 4,
# inits = 0,
# iter = 5000)%>%
# saveRDS(.,"../saves/bm_comb2.rds")
bm_comb <- readRDS("../saves/bm_comb.rds")
bm_comb2 <- readRDS("../saves/bm_comb2.rds")
We modeled the trial by trial data in the following way: correct ~ age * alignment + (alignment | subid) + (age * alignment | object) + (age * alignment | agent). We pre-registered to include item as a fixed effect in Experiment 3. The corresponding model was too complex to be constrained by the data. Furthermore, as explained in Experiment 1, items were chosen based on their rated age of acquisition. That is, we assumed that they are not necessarily different kinds but that they represent different locations on a distribution of required semantic knowledge. For further model specifications see Experiment 1).
The estimate for age was reliably positive (\(\beta\) = 0.81, 95% CrI: 0.4 - 1.24). The incongruent condition had a strong negative impact (\(\beta\) = -1.35, 95% CrI: -2.17 - -0.55), showing that children differentiated between the two conditions. The interaction term was weakly - though not entirely - negative, suggesting a shallower slope for age in the incongruent condition (\(\beta\) = -0.2, 95% CrI: -0.66 - 0.27). A model lacking object as a random effect provided a much poorer fit, suggesting substantial variation across objects (see Table 4). Figure 3B visualizes the model. Taken together, the results show that children responded to the way the two inferences were aligned with one another.
comb_waic <- brms::waic(bm_comb, bm_comb2, compare = F)
comb_weights <- model_weights(bm_comb, bm_comb2, weights = "waic")
comb_comp <- tibble(
Model = c("with object as RE", "without object as RE"),
WAIC = round(c(comb_waic$loos$bm_comb$estimates["waic","Estimate"],comb_waic$loos$bm_comb2$estimates["waic","Estimate"]),2),
SE = round(c(comb_waic$loos$bm_comb$estimates["waic","SE"],comb_waic$loos$bm_comb2$estimates["waic","SE"]),2),
weight = round(c(comb_weights[1],comb_weights[2]))
)
knitr::kable(comb_comp, caption = "Model comparison in experiment 3 based on WAIC scores and weights.", digits = 2, align = "l")
| Model | WAIC | SE | weight |
|---|---|---|---|
| with object as RE | 2188.59 | 46.97 | 1 |
| without object as RE | 2390.01 | 44.28 | 0 |
Figure 3: Proportion of choosing the object that was new to the speaker by age. Dots show the mean response for each participant. The solid black line shows the developmental trajectory based on the mean of the model posterior distribution. Lighter lines show 200 random draws from the posterior distribution to depict uncertainty. Dotted line indicates a level of performance expected by chance.
The experiments reported above show that children are sensitive to the types of information sources we intended to manipulate. Experiment 1 showed that children of all age groups make a mutual exclusivity inference, that the strength of this inference increases with age and, crucially, that it depends on the type of familiar object that is presented. Experiment 2 showed that children are sensitive the common ground manipulation we implemented and that this inference increases with age. Finally, experiment 3 showed that children respond differently depending on how the mutual exclusivity inference and the common ground inference are aligned with one another. In the next section, we use Bayesian cognitive models to address the question of how information sources are integrated when inferences are combined with one another.
The main purpose of the study was to study how children integrate different information sources during word learning and how this process develops with age. To do so, we use Bayesian cognitive models of pragmatic reasoning. We first describe an integration model which we think best represents the inference and integration processes and then specify how this model captures developmental change. Next, we ask how well this model predicts how children integrate information. That is, in a situation in which we know the development trajectories for the mutual exclusivity inference (for a particualr familiar object) as well as for the common ground inference, what can we say about what happens when they are combined. We then test the predictive power of the model by comparing the model predictions to the data from experiment 3. We use formal model comparison methods to test the integration model against a range of alternative models.
Finally, we ask how well our model explains the way that children integrate the different information sources. For this analysis, we fit the free parameters in the model to all the available data, those from experiment 1 and 2 as well as the integration data from experiment 3. We then compare the model to a range of alternative models that make different assumptions about how information is integrated and how this process develops. This approach answers the question of how we can best explain how children integrate the different information sources.
The cognitive models are situated in the Rational Speech Act (RSA) framework (Frank and Goodman 2012; Goodman and Frank 2016). RSA models are models of pragmatic reasoning in that they treat language understanding as a special case of Bayesian social reasoning. A listener interprets an utterance by assuming it was produced by a cooperative speaker who had the goal to be informative. Being informative is defined as providing a message that would increase the probability of the listener inferring the speaker’s intended message. This notion of contextual informativeness captures the Gricean idea of cooperation between speaker and listener.
\[ P_{L}(r \mid u)\propto P_{S}(u \mid r) \cdot P(r \mid \rho_i) \\ P_{S}(u \mid r)\propto Informativity(u;r)^{\alpha_i} \\ Informativity(u; r) = P(r|u) \propto P(r \mid \rho_i) \cdot \mathcal{L}(u \mid \theta_{ij}) \]
Our model describes a listener reasoning about the referent referred to by a speaker’s utterance. This reasoning is contextualized by the prior probability of each referent \(P(r \mid \rho_i)\) . This prior probability is thought to be a function of the common ground \(\rho\) shared between speaker and listener in that interacting around the objects changes the probability that they will be referred to later. We assume that the degree to which interactions around objects are integrated into the common ground and thus change the prior probability of those objects depends on the child’s age \(i\).
To decide between referents, the listener reasons about what a rational speaker would say given an intended referent. This speaker is assumed to compute the informativity for each available utterance and then choose the most informative one. However, this expectation of speaker informativeness may vary and is captured by the parameter \(\alpha\). In particular, we take \(\alpha\) to be a function of the child’s age \(i\).
The informativity of each utterance is given by imagining which referent a literal listener, who interprets words according to their lexicon \(\mathcal{L}\), would infer upon hearing the utterance. Thus, this reasoning depends on what kind of semantic knowledge (word–object mappings) the speaker thinks the literal listener knows. We parameterize the listener’s knowledge of a word’s semantics in terms of a semantic knowledge parameter \(\theta\), which varies between 0 and 1. \(\theta = 0\) correspondes to the state of knowledge for a completely novel word and results in a semantic interpretation function that chooses randomly between the objects in the scene. Each of the novel objects are assumed to have semantic knowledge of 0. For \(\theta \in (0, 1)\), the semantic interpretation function will pick out the familiar referent with probability \(\theta + \frac{1 - \theta}{2} = \frac{1 + \theta}{2}\); that is, with probability \(\theta\), the listener knows the correct meaning of the word (and picks out the correct referent 100% of the time); with probability \(1 - \theta\), the listener does not know the meaning of the word and must guess, picking out the correct referent 50% of the time. For familiar objects, semantic knowledge is a function of the degree-of-acquisition of the associated word, which in turn depends upon the kind of object \(j\) (its expected acquisition trajectory) as well as on the child’s age \(i\).
The model description above points to three potential loci of developmental change: semantic knowledge, expectations about speaker informativeness and sensitivity to common ground. Each of theses components is represented by a parameter that plays a particular functional role in the model. We capture developmental change by making these parameters a function of age and therefore estimating a developmental trajectory (intercept and slope) for each parameter.
Semantic knowledge captures the degree of certainty with which the naive listener is assumed to know the label for the familiar object. As a consequence, semantic knowledge differs between familiar objects. For objects whose labels are generally acquired earlier (e.g. carrot) semantic knowledge should generally be high whereas for others (e.g. pawn) semantic knowledge should generally be lower. However, semantic knowledge also varies with age such that older children are more likely to know the labels for more of the familiar objects compared to younger children. As a consequence, each familiar object has a unique developmental trajectory with respect to semantic knowledge. Technically, the object-specific parameters (\(\theta_{ij}\)) are estimated in the form of a hierarchical regression (mixed-effects) model; that is, each object’s trajectory is estimated as a deviation from an overall trajectory of vocabulary development. This overall (vocabulary) trajectory represents the development of semantic knowledge that is independent of a particular familiar object–label pairing (see Figure ??).
A second locus of developmental change is a listener’s expectations about speaker informativeness. In the context of the model, speaker informativeness corresponds to the degree with which the listener expects the speaker to choose the most informative of the available utterances. We assume that children at different ages could have different expectations about how rational or informative speakers are (see e.g. Bohn et al. 2019; Frank and Goodman 2014; Yoon and Frank 2019).
Sensitivity to common ground refers to the probability that an object is taken to be the referent of the utterance before actually hearing the utterance. Thus, it captures the salience of an object due to its role in the social interaction that precedes the utterance. We expect children at different ages to respond differently to the common ground manipulation, resulting in an age specific prior distribution over objects (akhtar1996role; Diesendruck et al. 2004).
All Bayesian cognitive models were implemented in the probabilistic programming language WebPPL (Goodman and Stuhlmüller 2014). The corresponding model code can be found in the associated online repository (file xxxxx). To generate model predictions, we estimated age sensitive parameter distributions for semantic knowledge (by familiar object), speaker informativeness and sensitivity to common ground and then passed them through the model in line with the different ways in which they can be combined and aligned. The resulting predictions come in the form of distributions of developmental trajectories for each object in the congruent and the incongruent condition.
We used the following prior distributions for model parameters. Intercept and slope for sensitivity to common ground: \(unif(-2,2)\). For the speaker informativeness we used \(unif(-3,3)\) for the intercept \(unif(-0,4)\) for the slope. We restricted the slope to be positive because negative values for speaker informativeness are conceptually implausible. For semantic knowledge we used \(unif(-3,3)\) for the intercept and \(unif(0,2)\) for the slope, because it is implausible to assume that semantic knowledge decreases with age. For the parameter capturing the variability the object specific trajectories around these overall parameters we used \(unif(0,2)\) for the intercept and \(unif(0,1)\) for the slope. Some choices regarding these prior distributions were made to ease model convergence. However, please note that all models considered used the same prior distributions.
In this section we evaluate different models in terms of how well they predict information integration. That is, in a situation in which we know the development of the mutual exclusivity inference as well as the common ground inference, we look at each model’s ability to predict what happens when the two are combined (combination data from Experiment 3). Investigating “pure” (or, a priori) prediction automatically excludes all models which include parameters that need to be fit to the combination data itself (e.g., a heuristic, non-integrating mixture model, described below). To generate a priori predictions, we independently estimated the model parameters for semantic knowledge and speaker informativeness based on Experiment 1 and the parameter for common ground sensitivity based on Experiment 2.
To estimate the parameters for semantic knowledge and speaker informativeness, we adapted the model described above to a situation in which both objects (novel and familiar) have equal prior probability (i.e., no common ground information). We used the data from Experiment 1 to then infer the parameters. That is, we inferred the intercepts and slopes for speaker informativeness (linear regression) and semantic knowledge (logistic regression) that generated RSA model predictions to match the responses generated in Experiment 1. To estimate the parameters representing sensitivity to common ground, we used a simple logistic regression to infer which combination of intercept and slope would generate predictions that corresponded to the average proportion of correct responses measured in Experiment 2.
To estimate the parameter distributions, we collected samples from six independent MCMC chains, collecting 150,000 samples from each chain and removing the first 50,000 for burn-in. We excluded samples from one chain because it got stuck on a local maximum which resulted in parameter distributions that were substantially different from the other chains. The model outputs can be found in the following online repository: git large file storage.
Next, we combined the parameters according to the four models described below. Note that the parameter distributions were the same for all models (see Figure 4) and that models only differed in terms of which parameters they included. The models described below are a full model (integration model) and three lesioned models, which selectively omit one type of information. The following model comparison therefore asks which types of information are necessary to make good predictions about how information is integrated. We do not compare models that make different assumptions about how information is integrated, since they require additional parameters specific to Experiment 3. We consider the question of alternative integration models in the explanation section.
Figure 4: Developmental trajectories for model parameters based on the posterior distribution for (A) semantic knowlede, (B) speaker informativeness and (C) prior sensitivity. Solid lines in show the MAP estimate for each parameter. Lighter lines in (B) and (C) show 300 random draws from the posterior distributon to visualize uncertainty. (A) does not include these random draws for the sake of clarity.
The integration model serves as the full model and takes in all available information. That is, it takes in object-specific semantic knowledge, speaker informativeness and common ground sensitivity and combines these components by way of the process described above. Figure 5 visualizes the corresponding model predictions in comparison to the data from experiment 3.
Figure 5: Predicting information integration across development. Model predictions based on the integration model. Colored lines show developmental trajectories for each familiar object and condition based on 300 random draws from the model posterior distribution. Top row (blue) shows the congruent condition and the bottom row (red) shows the inconguent condition. Familiar objects are ordered based on their rated age of acquisition (left to right). Dashed black lines show smoothed conditional mean of the data with 95% CI (in grey). Light dots are individual data points.
The first lesionsed model takes in speaker informativeness and common ground sensitivity as well as general semantic knowledge, but omits semantic knowledge that is specific to the familiar objects. We described above that the parameters for semantic knowledge are fitted via a hierarchical regression (mixed effects) model. In this model, there is an overall developmental trajectory for semantic knowledge (main effect) and then there is object-specific variation around this trajectory (random effects). The no word knowledge model takes in the overall trajectory for semantic knowledge but ignores object-specific variation. That is, the model assumes a listener whose mutual exclusivity inference does not vary depending on the particular familiar object but only depends on the average semantic knowledge.
This model takes in object specific semantic knowledge and speaker informativeness but ignores common ground. Thus, the prior distribution over objects in the model described above is uniform (e.g., [0.5, 0.5]). This corresponds to a listener who only focuses on the mutual exclusivity inference and ignores the common ground manipulation. As a consequence, the listener does not differentiate between the two common ground alignment conditions.
The last lesion model only takes in the common ground sensitivity. This corresponds to a listener who only focuses on common ground and ignores the identity of the objects on the tables as well as any inferences their semantic knowledge of the familiar objects license. The model predictions therefore correspond to the prior distribution over objects \(P(r \mid \rho_i)\).
We compared the models mentioned above in two ways. On the one hand, we used correlations between model predictions and the data. For this analysis, we binned the model predictions and the data by age in years and familiar object. Figure 6 visualizes the correlation between model predictions and the data for all models. The results shows a very high correlation between the predictions of the integration model and the data in all age groups suggesting that the model accurately captures the variation in the data. Correlations for the integration model were also higher compared to the other models considered. The correlation increased from 2- to 3-year-olds but then again dropped for 4-year-olds. The mis-match for the oldest age group is probably a consequence of the model making very extreme predictions in the congruent condition. This results from the fact that 4-year-olds show very high performance in the mututal exclusivity as well as the common ground task. When combined in the model, the two inferences amplify one another because we do not assume that there is any cost of integration. We made this choice because the parameter representing integration cost would have to be estimated based on the integration data itslef.
We additionally compared models based on the marginal likelihood of the data under each model – the likelihood of the data averaging over (“marginalizing over”) the prior distribution on parameters; the pair-wise ratio of marginal likelihoods for two models is known as the Bayes Factor (see file model_comparison.Rmd in the associated online repository). Bayes Factors quantify the quality of predictions of a model, averaging over the possible values of the parameters of the models (weighted by their prior probability); by averaging over the prior distribution on parameters, Bayes Factor implicitly take into account model complexity because models with more parameters will tend to have a broader prior distribution over parameters, which in effect, can water down the possible better predictions that a model with more parameters can achieve (Lee and Wagenmakers 2014). For this analysis, we treated age continuously. Table 5 lists the Bayes factors for the different model comparisons. The results show that the integration model, by far, outperformed all the other models. When comparing the lesion models among each other, we see that models including the mutual exclusivity inference make better predictions compared to the no mutual exclusivity model.
Taken together, these analyses showed two things. First, the integration model makes accurate predictions about how mutual exclusivity and common ground inferences are integrated. It does so based on knowing the strength and development of each inference alone, and incorporating them into a structured probabilistic model of pragmatic reasoning. Second, models that omit one or more types of information (object specific word knowledge, speaker informativeness, common ground sensitivity) make appreciably worse predictions. This result exemplifies that children across the entire age range flexibly integrate all the available information. In the next section we ask whether there are other ways think about the process of information integration than the way formalized in the integration model.
Figure 6: Predicting information integration. Correlations between model predictions and data binned by year, item and condition. Vertical and horizontal error bars show 95% HDI. Blue diamonds show congruent condition and red ones show the incongruent condition.
| Model comparison | Bayes factor |
|---|---|
| integration vs no word knowledge | 124858235890685 |
| integration vs no common ground | 699235921164959 |
| integration vs no mutual exclusivity | 78825559528126053435259473126784820576256 |
| no word knowledge vs no mutual exclusivity | 6 |
| no word knowledge vs no common ground | 631320464892188619261870080 |
| no mutual exclusivity vs no common ground | 112730992705293381995069440 |
In this section we explore how we can best explain information integration across development. We explore alternative ways to think about information integration. The integration model outlined above operates via Bayesian inference in that the prior probability of a referent (a consequence of the common ground manipulation) is updated via the likelihood of hearing the utterance heard from a speaker, which is used to derive the mutual exclusivity inference. In essence, this model is a multiplicative model because the posterior probability of each referent given the utterance is proportional to the product of the likelihood of a speaker saying that utterance and the prior probability of that referent.
Information sources need not be integrated in a Bayesian manner. Rather than be part of the same pragmatic reasoning architecture, the common ground and mutual exclusitivity inferences we consider could be computed separately and then combined. A listener could integrate these two independent inferences in an additive manner, weighting the sources of information by some ratio \(\phi\). \(\phi\) is then a bias to prefer one type of inference relative to the other. We formalize this alternative hypothesis as a mixture model.
The integration and mixture models have potentially different implications for developmental change. The integration model assumes that the process by which information is integrated remains constant across development. What changes is children’s semantic knowledge, their expectations about speaker informativeness, and their sensitivity to common ground, but the way these information sources are combined remains the same across age. The mixture model alternatively posits that there is some bias or preference for one information source, and this bias could change across development. This developmental mixture model makes the same assumptions about developmental change in the individual information sources as the integration model but, in addition, assumes that the way that the the mutual exclusivity and the common ground inference are combined with one another changes over time. This model is structurally identical to the mixture model but the mixture parameter \(\phi\) is a function of age (i.e. represented by an intercept and a slope).
In this section, we make use of all the available data to arbitrate between these alternative models (a fully Bayesian analysis). As before, Experiments 1 and 2 directly constrain the parameters governing semantic knowledge, speaker informativeness, and the prior distribution over referents (via common ground; Expt. 2). Now, in addition, we incorporate the data from Experiment 3 to additionally constrain these parameters as well as inform the mixture parameter for the mixture model.
The integration model in this section differs from the integration model in the Prediction Section only in that the parameter distributions are now additionally informed by the data from Experiment 3. Figure 10 - 12 in the Appendix show how the parameter distributions differ between the prediction and the explanation version of the integration model. Semantic knowledge and speaker informativeness have similar posterior distributions when taking into account all the data compared to when estimating these parameters only based on experiment 1 and 2. In contrast, the intercept for common ground sensitivity is estimated to be larger and the slope shallower after taking into account Experiment 3 data. That is, our best guess for common ground sensitivity after taking into account the data from Experiment 3 is that younger children are more sensitive to the common ground manipulation and there is less developmental change. The code to run the model can be found in the associated online repository (file: xxxxxxx). Figure 7 shows model predictions for the integration model in comparison to the data from experiment 3.
Figure 7: Explaining information integration across development. Model predictions based on the integration model. Colored lines show developmental trajectories for each familiar object and condition based on 300 random draws from the model posterior distribution. Top row (blue) shows the congruent condition and the bottom row (red) shows the inconguent condition. Familiar objects are ordered based on their rated age of acquisition (left o right). Dashed black lines show smoothed conditional mean of the data with 95% CI (in grey). Light dots are individual data points.
In the mixture model the two inferences (common ground and mutual exclusivity) are computed in the same way as in in the integration model. Subsequently, they are weighted by the mixture parameter \(\phi\):
\[P^{mixture}_{L_1}(r, \mathcal{L}|u) = \phi \cdot P_{S_1}(u|r_{t}, \mathcal{L}) + (1-\phi) P( \mathcal{L})P(r)\]
To estimate \(\phi\) as well as the other parameters, we make use of all the available data. The model code can be found in the associated online repository (file: xxxxxxx). The posterior distribution for the mixture parameter \(\phi\) is shown in figure 8A. It suggests that the mutual exclusivity inference is weighted as slightly more important compared to the common ground inference.
For this model, we make the mixture parameter \(\phi\) a function of age and estimate the intercept and slope that yield the best model predictions compared to the data from experiment 3. Figure 8B visualizes the developmental trajectory of the mixture parameter. Based on this model, the common ground inference seems to decrease in importance compared to the mutual exclusivity inference with age.
Figure 8: Mixture component for the mixture model (A) and the developmental mixture model (B). (A) shows the posterior distribution of the mixture component and (B) shows developmental trajectories for the mixture component based on 300 random draws from the posterior distribution for intercept and slope.
We compared models based on correlations and Bayes factors. Figure 9 shows correlations between model predictions and the data, each binned by year and object. Even though model predictions and data are closely aligned for all models, the integration model shows the highest correlation in all age groups. Next we directly compared models based using Bayes factors (see Table 6). We also included the prediction integration model into this analysis.
Perhaps unsurprisingly, we see that further constraining the parameters for semantic knowledge, speaker informativeness and common ground sensitivity by the data from experiment 3 greatly improves the model fit (comparison: integration (explanation) vs integration (prediction)). We also see that the explanation integration model provides, by far, the best fit to the data compared to the two mixture models. Interestingly, the prediction integration model also had a better fit, even though its parameters were not constrained by the data from experiment 3. When comparing the two mixture models directly, we see that an age sensitive mixture parameter did not result in a substantially better fit.
This analysis shows that the inference and integration processes described by the ingetration model accurately capture the data and also explain information integration better compared to the additive mixture models. As a consequence, we may say that instead of being biased towards one type of inference, children are rationally integrating all the information sources available.
Figure 9: Explaining information integration. Correlations between model predictions and data binned by year, item and condition. Vertical and horizontal error bars show 95% HDI. Blue diamonds show congruent condition and red ones show the incongruent condition.
| Model comparison | Bayes factor |
|---|---|
| integration (explanation) vs integration (prediction) | 128748 |
| integration (explanation) vs mixture | 26028662 |
| integration (explanation) vs developmental mixture | 6280259 |
| integration (prediction) vs mixture | 202 |
| integration (prediction) vs developmental mixture | 49 |
| developmental mixture vs mixture | 4 |
Here we studied how 2 to 5 year old children integrate semantic and pragmatic information during word learning. In three experiments, we first showed that children make a mutual exclusivity inference and that this inference varied depending on children’s familiarity with the objects involved (experiment 1). Next, we showed that children make common ground inferences based on their interactions with a speaker (experiment 2). When the two inferences were combined, we found that children were sensitive to the way in which they were aligned (experiment 3).
We then introduced a computational model to investigate the process by which children integrated the inferences in experiment 3. As a start, we described mutual exclusivity as a pragmatic inference, which takes in children’s emerging semantic knowledge and their expectations about how informative a speaker is. The integration model assumes that this inference is then flexibly integrated with children’s developing sensitivity to common ground.
Next, we tested the predictive power of this model. That is, we asked how well the model would predict the data of experiment 3, when only knowing the developmental trajectories for mutual exclusivity (based on experiment 1) and common ground (experiment 2). We found a very close alignment of the model predictions and the data across the entire age range. Furthermore, the integration model provided a better fit to the data compared to a number of lesioned models, which selectively omitted one type of information. This suggests that children flexibly integrate all available information.
In the final section, we studied which process best explained children’s information integration. We compared the integration model to a mixture model which assumed that children are biased towards one type of inference. We found that the integration model better explained the data compared to these alternative model. In sum, we found that children’s integration of semantic and pragmatic information during word learning is best described as a form of Bayesian social inference.
In the following, we visualize the model parameters for semantic knowledge, speaker informativeness and common ground sensitivity. Please note that the alternative lesion models presented in the prediction section used the same parameter distributions as the prediction integration model.
Figure 10: Posterior distribution of intercept term for semantic knowledge for each object by model.
Figure 11: Posterior distribution of slope term for semantic knowledge for each object by model.
Figure 12: Posterior distribution of slope and intercept terms for speaker informativeness and sensitivity to common ground by model.
Akhtar, Nameera, Malinda Carpenter, and Michael Tomasello. 1996. “The Role of Discourse Novelty in Early Word Learning.” Child Development 67 (2). Wiley Online Library: 635–45.
Bohn, Manuel, Michael H Tessler, Megan Merrick, and Michael C Frank. 2019. “Predicting Pragmatic Cue Integration in Adults’ and Children’s Inferences About Novel Word Meanings,” September. PsyArXiv. doi:10.31234/osf.io/xma4f.
Bürkner, Paul-Christian. 2017. “brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80 (1): 1–28. doi:10.18637/jss.v080.i01.
Clark, Eve V. 1987. “The Principle of Contrast: A Constraint on Language Acquisition.” Lawrence Erlbaum Associates, Inc.
Diesendruck, Gil, Lori Markson, Nameera Akhtar, and Ayelet Reudor. 2004. “Two-Year-Olds’ Sensitivity to Speakers’ Intent: An Alternative Account of Samuelson and Smith.” Developmental Science 7 (1). Wiley Online Library: 33–41.
Frank, Michael C, and Noah D Goodman. 2012. “Predicting Pragmatic Reasoning in Language Games.” Science 336 (6084). American Association for the Advancement of Science: 998–98.
———. 2014. “Inferring Word Meanings by Assuming That Speakers Are Informative.” Cognitive Psychology 75. Elsevier: 80–96.
Frank, Michael C, Elise Sugarman, Alexandra C Horowitz, Molly L Lewis, and Daniel Yurovsky. 2016. “Using Tablets to Collect Data from Young Children.” Journal of Cognition and Development 17 (1). Taylor & Francis: 1–17.
Goodman, Noah D, and Michael C Frank. 2016. “Pragmatic Language Interpretation as Probabilistic Inference.” Trends in Cognitive Sciences 20 (11). Elsevier: 818–29.
Goodman, Noah D, and Andreas Stuhlmüller. 2014. “The design and implementation of probabilistic programming languages.” http://dippl.org.
Kuperman, Victor, Hans Stadthagen-Gonzalez, and Marc Brysbaert. 2012. “Age-of-Acquisition Ratings for 30,000 English Words.” Behavior Research Methods 44 (4). Springer: 978–90.
Lee, Michael D, and Eric-Jan Wagenmakers. 2014. Bayesian Cognitive Modeling: A Practical Course. Cambridge University Press.
Lewis, Molly L, Veronica Cristiano, Brenden M. Lake, Tammy Kwan, and Michael C Frank. 2020. “The Role of Developmental Change and Linguistic Experience in the Mutual Exclusivity Effect.” Cognition 198: 104191.
Markman, Ellen M, and Gwyn F Wachtel. 1988. “Children’s Use of Mutual Exclusivity to Constrain the Meanings of Words.” Cognitive Psychology 20 (2). Elsevier: 121–57.
McElreath, Richard. 2016. Statistical rethinking: A bayesian course with examples in R and Stan. Texts in Statistical Science. Boca Raton: CRC Press.
Morey, Richard D., and Jeffrey N. Rouder. 2018. BayesFactor: Computation of Bayes Factors for Common Designs. https://CRAN.R-project.org/package=BayesFactor.
Yoon, Erica J, and Michael C Frank. 2019. “The Role of Salience in Young Children’s Processing of Ad Hoc Implicatures.” Journal of Experimental Child Psychology 186. Elsevier: 99–116.